Search Results: "drb"

30 December 2012

Iustin Pop: Interesting tool of the day: ghc-gc-tune

Courtesy of a recent Google+ post/Stack overflow answer, I stumbled upon ghc-gc-tune, a simple but nice tool which generates interesting graphs for Haskell programs. What is does is quite trivial: iterate over a range of arena and heap sizes, and run a specified program with those RTS options, then generate a graph comparing the performance (by default cpu time) across the combinations of the values. Note that for newer GHC versions, you'll need to link with -rtsopts to allow for the -A/-H custom sizes. The reason I mention it is that the graphs it generates can be quite interesting. For one of the Ganeti programs, hspace, run with the command line ./hspace --simu p,20,1t,96g,16,1 --disk-template drbd, it generates this graph: runtime graph with trivial usage This picture, if I read it correctly, says that this program is actually well behaved (well, +RTS -s says "3 MB total memory in use" with default options), and that the optimum sizes actually relate to the (L1? L2?) cache size. But note that the maximum difference is only about ~1.6 . By changing the parameters to hspace and make it allocate more memory (./hspace --simu p,20,1t,128g,256,32 --disk-template plain --tiered=1g,128m,1, ~52MB reported by +RTS -s), the graph changes significantly: runtime graph with 50mb usage Now we have somewhat the opposite situation: very small arena sizes are detrimental (and by a big factor, 4.5 ), large arena/heap sizes are OK-ish, and the sweet spot is around 2-4MB arena size with heap sizes up to 4MB. Now maybe these particular examples were not very elightening (and they were definitely not well-conducted tests, etc.), but they should allow some intuition into how the program behaves. Plus, the tool can also generate other plots, for example peak memory usage.

26 November 2012

Russell Coker: Links November 2012

Julian Treasure gave an informative TED talk about The 4 Ways Sound Affects US [1]. Among other things he claims that open plan offices reduce productivity by 66%! He suggests that people who work in such offices wear headphones and play bird-songs. Naked Capitalism has an interesting interview between John Cusack and Jonathan Turley about how the US government policy of killing US citizens without trial demonstrates the failure of their political system [2]. Washington s blog has an interesting article on the economy in Iceland [3]. Allowing the insolvent banks to go bankrupt was the best thing that they have ever done for their economy. Clay Shirky wrote an insightful article about the social environment of mailing lists and ways to limit flame-wars [4]. ZRep is an interesting program that mirrors ZFS filesystems via regular snapshots and send/recv operations [5]. It seems that it could offer similar benefits to DRBD but at the file level and with greater reliability. James Lockyer gave a movingTEDx talk about his work in providing a legal defence for the wrongly convicted [6]. This has included overturning convictions after as much as half a century in which the falsely accused had already served a life sentence. Nathan Myers wrote an epic polemic about US government policy since 9-11 [7]. It s good to see that some Americans realise it s wrong. There is an insightful TED blog post about TED Fellow Salvatore Iaconesi who has brain cancer [8]. Apparently he had some problems with medical records in proprietary formats which made it difficult to get experts to properly assess his condition. Open document standards can be a matter of life and death and should be mandated by federal law. Paul Wayper wrote an interesting and amusing post about Emotional Computing which compares the strategies of Apple, MS, and the FOSS community among other things [9]. Kevin Allocca of Youtube gave an insightful TED talk about why videos go viral [10]. Jason Fried gave an interesting TED talk Why Work Doesn t Happen at Work [11]. His main issues are distraction and wasted time in meetings. He gives some good ideas for how to improve productivity. But they can also be used for sabotage. If someone doesn t like their employer then they could call for meetings, incite managers to call meetings, and book meetings so that they don t follow each other and thus waste more of the day (EG meetings at 1PM and 3PM instead of having the second meeting when the first finishes). Shyam Sankar gave an interesting TED talk about human computer cooperation [12]. He describes the success of human-computer partnerships in winning chess tournaments, protein folding, and other computational challenges. It seems that the limit for many types of computation will be the ability to get people and computers to work together efficiently. Cory Doctorow wrote an interesting and amusing article for Locus Magazine about some of the failings of modern sci-fi movies [13]. He is mainly concerned with pointless movies that get the science and technology aspects wrong and the way that the blockbuster budget process drives the development of such movies. Of course there are many other things wrong with sci-fi movies such as the fact that most of them are totally implausible (EG aliens who look like humans). The TED blog has an interesting interview with Catarina Mota about hacker spaces and open hardware [14]. Sociological Images has an interesting article about sporting behaviour [15]. They link to a very funny youtube video of a US high school football team who make the other team believe that they aren t playing until they win [16] Related posts:
  1. Links April 2012 Karen Tse gave an interesting TED talk about how to...
  2. Links March 2012 Washington s Blog has an informative summary of recent articles about...
  3. Links November 2011 Forbes has an interesting article about crowd-sourcing by criminals and...

27 April 2012

Russell Coker: BTRFS and ZFS as Layering Violations

LWN has an interesting article comparing recent developments in the Linux world to the Unix Wars that essentially killed every proprietary Unix system [1]. The article is really interesting and I recommend reading it, it s probably only available to subscribers at the moment but should be generally available in a week or so (I used my Debian access sponsored by HP to read it). A comment on that article cites my previous post about the reliability of RAID [2] and then goes on to disagree with my conclusion that using the filesystem for everything is the right thing to do. The Benefits of Layers I don t believe as strongly in the BTRFS/ZFS design as the commentator probably thinks. The current way my servers (and a huge number of other Linux systems) work of having RAID to form a reliable array of disks from a set of cheap disks for the purpose of reliability and often capacity or performance is a good thing. I have storage on top of the RAID array and can fix the RAID without bothering about the filesystem(s) and have done so in the past. I can also test the RAID array without involving any filesystem specific code. Then I have LVM running on top of the RAID array in exactly the same way that it runs on top of a single hard drive or SSD in the case of a laptop or netbook. So Linux on a laptop is much the same as Linux on a server in terms of storage once we get past the issue of whether a single disk or a RAID array is used for the LVM PV, among other things this means that the same code paths are used and I m less likely to encounter a bug when I install a new system. LVM provides multiple LVs which can be used for filesystems, swap, or anything else that uses storage. So if a filesystem gets badly corrupted I can umount it, create an LVM snapshot, and then take appropriate measures to try and fix it without interfering with other filesystems. When using layered storage I can easily add or change layers when it s appropriate. For example I have encryption on only some LVs on my laptop and netbook systems (there is no point encrypting the filesystem used for .iso files of Linux distributions) and on some servers I use RAID-0 for cached data. When using a filesystem like BTRFS or ZFS which includes subvolumes (similar in result to LVM in some cases) and internal RAID you can t separate the layers. So if something gets corrupted then you have to deal with all the complexity of BTRFS or ZFS instead of just fixing the one layer that has a problem. Update: One thing I forgot to mention when I first published this is the benefits of layering for some uncommon cases such as network devices. I can run an Ext4 filesystem over a RAID-1 array which has one device on NBD on another system. That s a bit unusual but it is apparently working well for some people. The internal RAID on ZFS and BTRFS doesn t support such things and using software RAID underneath ZFS or BTRFS loses some features. When using DRBD you might have two servers with local RAID arrays, DRBD on top of that, and then an Ext4 filesystem. As any form of RAID other than internal RAID loses reliability features for ZFS and BTRFS that means that no matter how you might implement those filesystems with DRBD it seems that you will lose somehow. It seems that neither BTRFS nor ZFS supports a disconnected RAID mode (like a Linux software RAID with a bitmap so it can resync only the parts that didn t change) so it s not possible to use BTRFS or ZFS RAID-1 with an NBD device. The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that. The Benefits of Integration When RAID and the filesystem are separate things (with some added abstraction from LVM) it s difficult to optimise the filesystem for RAID performance at the best of times and impossible in many cases. When the filesystem manages RAID it can optimise it s operation to match the details of the RAID layout. I believe that in some situations ZFS will use mirroring instead of RAID-Z for small writes to reduce the load and that ZFS will combine writes into a single RAID-Z stripe (or set of contiguous RAID-Z stripes) to improve write performance. It would be possible to have a RAID driver that includes checksums for all blocks, it could then read from another device when a checksum fails and give some of the reliability features that ZFS and BTRFS offer. Then to provide all the reliability benefits of ZFS you would at least need a filesystem that stores multiple copies of the data which would of course need checksums (because the filesystem could be used on a less reliable block device) and therefore you would end up with two checksums on the same data. Note that if you want to have a RAID array with checksums on all blocks then ZFS has a volume management feature (which is well described by Mark Round) [3]. Such a zvol could be used for a block device in a virtual machine and in an ideal world it would be possible to use one as swap space. But the zvol is apparently managed with all the regular ZFS mechanisms so it s not a direct list of blocks on disk and thus can t be extracted if there is a problem with ZFS. Snapshots are an essential feature by today s standards. The ability to create lots of snapshots with low overhead is a significant feature of filesystems like BTRFS and ZFS. Now it is possible to run BTRFS or ZFS on top of a volume manager like LVM which does snapshots to cover the case of the filesystem getting corrupted. But again that would end up with two sets of overhead. The way that ZFS supports snapshots which inherit encryption keys is also interesting. Conclusion It s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn t much interest in developing such things. So while people would use it (and people are using ZFS ZVols as block devices for other filesystems as described in a comment on Mark Round s blog) it s probably not going to be implemented. Therefore we have a choice of all the complexity and features of BTRFS or ZFS, or the current RAID+LVM+Ext4 option. While the complexity of BTRFS and ZFS is a concern for me (particularly as BTRFS is new and ZFS is really complex and not well supported on Linux) it seems that there is no other option for certain types of large storage at the moment. ZFS on Linux isn t a great option for me, but for some of my clients it seems to be the only option. ZFS on Solaris would be a better option in some ways, but that s not possible when you have important Linux software that needs fast access to the storage. Related posts:
  1. Starting with BTRFS Based on my investigation of RAID reliability [1] I have...
  2. ZFS vs BTRFS on Cheap Dell Servers I previously wrote about my first experiences with BTRFS [1]....
  3. Reliability of RAID ZDNet has an insightful article by Robin Harris predicting the...

8 February 2012

Russell Coker: More DRBD Performance tests

I ve previously written Some Notes on DRBD [1] and a post about DRBD Benchmarking [2]. Previously I had determined that replication protocol C gives the best performance for DRBD, that the batch-time parameters for Ext4 aren t worth touching for a single IDE disk, that barrier=0 gives a massive performance boost, and that DRBD gives a significant performance hit even when the secondary is not connected. Below are the results of some more tests of delivering mail from my Postal benchmark to my LMTP server which uses the Dovecot delivery agent to write it to disk, the rates are in messages per minute where each message is an average of 70K in size. The ext4 filesystem is used for all tests and the filesystem features list is has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize .
p4-2.8
Default Ext4 1663
barrier=0 2875
DRBD no secondary al-extents=7 645
DRBD no secondary default 2409
DRBD no secondary al-extents=1024 2513
DRBD no secondary al-extents=3389 2650
DRBD connected 1575
DRBD connected al-extents=1024 1560
DRBD connected al-extents=1024 Gig-E 1544
The al-extents option determines the size of the dirty areas that need to be resynced when a failed node rejoins the cluster. The default is 127 extents of 4M each for a block size of 508MB to be synchronised. The maximum is 3389 for a synchronisation block size of just over 13G. Even with fast disks and gigabit Ethernet it s going to take a while to synchronise things if dirty zones are 13GB in size. In my tests using the maximum size of al-extents gives a 10% performance benefit in disconnected mode while a size of 1024 gives a 4% performance boost. Changing the al-extents size seems to make no significant difference for a connected DRBD device. All the tests on connected DRBD devices were done with 100baseT apart from the last one which was a separate Gigabit Ethernet cable connecting the two systems. Conclusions For the level of traffic that I m using it seems that Gigabit Ethernet provides no performance benefit, the fact that it gave a slightly lower result is not relevant as the difference is within the margin of error. Increasing the al-extents value helps with disconnected performance, a value of 1024 gives a 4% performance boost. I m not sure that a value of 3389 is a good idea though. The ext4 barriers are disabled by DRBD so a disconnected DRBD device gives performance that is closer to a barrier=0 mount than a regular ext4 mount. With the significant performance difference between connected and disconnected modes it seems possible that for some usage scenarios it could be useful to disable the DRBD secondary at times of peak load it depends on whether DRBD is used as a really current backup or a strict mirror. Future Tests I plan to do some tests of DRBD over Linux software RAID-1 and tests to compare RAID-1 with and without bitmap support. I also plan to do some tests with the BTRFS filesystem, I know it s not ready for production but it would still be nice to know what the performance is like. But I won t use the same systems, they don t have enough CPU power. In my previous tests I established that a 1.5GHz P4 isn t capable of driving the 20G IDE disk to it s maximum capacity and I m not sure that the 2.8GHz P4 is capable of running a RAID to it s capacity. So I will use a dual-core 64bit system with a pair of SATA disks for future tests. The difference in performance between 20G IDE disks and 160G SATA disks should be a lot less than the performance difference between a 2.8GHz P4 and a dual-core 64bit CPU. Related posts:
  1. DRBD Benchmarking I ve got some performance problems with a mail server that s...
  2. Some Notes on DRBD DRBD is a system for replicating a block device across...
  3. Ethernet bonding Bonding is one of the terms used to describe multiple...

5 February 2012

Russell Coker: Reliability of RAID

ZDNet has an insightful article by Robin Harris predicting the demise of RAID-6 due to the probability of read errors [1]. Basically as drives get larger the probability of hitting a read error during reconstruction increases and therefore you need to have more redundancy to deal with this. He suggests that as of 2009 drives were too big for a reasonable person to rely on correct reads from all remaining drives after one drive failed (in the case of RAID-5) and that in 2019 there will be a similar issue with RAID-6. Of course most systems in the field aren t using even RAID-6. All the most economical hosting options involve just RAID-1 and RAID-5 is still fairly popular with small servers. With RAID-1 and RAID-5 you have a serious problem when (not if) a disk returns random or outdated data and says that it is correct, you have no way of knowing which of the disks in the set has good data and which has bad data. For RAID-5 it will be theoretically possible to reconstruct the data in some situations by determining which disk should have it s data discarded to give a result that passes higher level checks (EG fsck or application data consistency), but this is probably only viable in extreme cases (EG one disk returns only corrupt data for all reads). For the common case of a RAID-1 array if one disk returns a few bad sectors then probably most people will just hope that it doesn t hit something important. The case of Linux software RAID-1 is of interest to me because that is used by many of my servers. Robin has also written about some NetApp research into the incidence of read errors which indicates that 8.5% of consumer disks had such errors during the 32 month study period [2]. This is a concern as I run enough RAID-1 systems with consumer disks that it is very improbable that I m not getting such errors. So the question is, how can I discover such errors and fix them? In Debian the mdadm package does a monthly scan of all software RAID devices to try and find such inconsistencies, but it doesn t send an email to alert the sysadmin! I have filed Debian bug #658701 with a patch to make mdadm send email about this. But this really isn t going to help a lot as the email will be sent AFTER the kernel has synchronised the data with a 50% chance of overwriting the last copy of good data with the bad data! Also the kernel code doesn t seem to tell userspace which disk had the wrong data in a 3-disk mirror (and presumably a RAID-6 works in the same way) so even if the data can be corrected I won t know which disk is failing. Another problem with RAID checking is the fact that it will inherently take a long time and in practice can take a lot longer than necessary. For example I run some systems with LVM on RAID-1 on which only a fraction of the VG capacity is used, in one case the kernel will check 2.7TB of RAID even when there s only 470G in use! The BTRFS Filesystem The btrfs Wiki is currently at btrfs.ipv5.de as the kernel.org wikis are apparently still read-only since the compromise [3]. BTRFS is noteworthy for doing checksums on data and metadata and for having internal support for RAID. So if two disks in a BTRFS RAID-1 disagree then the one with valid checksums will be taken as correct! I ve just done a quick test of this. I created a filesystem with the command mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid? and copied /dev/urandom to it until it was full. I then used dd to copy /dev/urandom to some parts of /dev/vg0/raidb while reading files from the mounted filesystem that worked correctly although I was disappointed that it didn t report any errors, I had hoped that it would read half the data from each device and fix some errors on the fly. Then I ran the command btrfs scrub start . and it gave lots of verbose errors in the kernel message log telling me which device had errors and where the errors are. I was a little disappointed that the command btrfs scrub status . just gave me a count of the corrected errors and didn t mention which device had the errors. It seems to me that BTRFS is going to be a much better option than Linux software RAID once it is stable enough to use in production. I am considering upgrading one of my less important servers to Debian/Unstable to test out BTRFS in this configuration. BTRFS is rumored to have performance problems, I will test this but don t have time to do so right now. Anyway I m not always particularly concerned about performance, I have some systems where reliability is important enough to justify a performance loss. BTRFS and Xen The system with the 2.7TB RAID-1 is a Xen server and LVM volumes on that RAID are used for the block devices of the Xen DomUs. It seems obvious that I could create a single BTRFS filesystem for such a machine that uses both disks in a RAID-1 configuration and then use files on the BTRFS filesystem for Xen block devices. But that would give a lot of overhead of having a filesystem within a filesystem. So I am considering using two LVM volume groups, one for each disk. Then for each DomU which does anything disk intensive I can export two LVs, one from each physical disk and then run BTRFS inside the DomU. The down-side of this is that each DomU will need to scrub the devices and monitor the kernel log for checksum errors. Among other things I will have to back-port the BTRFS tools to CentOS 4. This will be more difficult to manage than just having an LVM VG running on a RAID-1 array and giving each DomU a couple of LVs for storage. BTRFS and DRBD The combination of BTRFS RAID-1 and DRBD is going to be a difficult one. The obvious way of doing it would be to run DRBD over loopback devices that use large files on a BTRFS filesystem. That gives the overhead of a filesystem in a filesystem as well as the DRBD overhead. It would be nice if BTRFS supported more than two copies of mirrored data. Then instead of DRBD over RAID-1 I could have two servers that each have two devices exported via NBD and BTRFS could store the data on all four devices. With that configuration I could lose an entire server and get a read error without losing any data! Comparing Risks I don t want to use BTRFS in production now because of the risk of bugs. While it s unlikely to have really serious bugs it s theoretically possible that as bug could deny access to data until kernel code is fixed and it s also possible (although less likely) that a bug could result in data being overwritten such that it can never be recovered. But for the current configuration (Ext4 on Linux software RAID-1) it s almost certain that I will lose small amounts of data and it s most probable that I have silently lost data on many occasions without realising. Related posts:
  1. Some RAID Issues I just read an interesting paper titled An Analysis of...
  2. ECC RAM is more useful than RAID A common myth in the computer industry seems to be...
  3. Software vs Hardware RAID Should you use software or hardware RAID? Many people claim...

5 January 2012

Russell Coker: DRBD Benchmarking

I ve got some performance problems with a mail server that s using DRBD so I ve done some benchmark tests to try and improve things. I used Postal for testing delivery to an LMTP server [1]. The version of Postal I released a few days ago had a bug that made LMTP not work, I ll release a new version to fix that next time I work on Postal or when someone sends me a request for LMTP support (so far no-one has asked for LMTP support so I presume that most users don t mind that it s not yet working). The local spool on my test server is managed by Dovecot, the Dovecot delivery agent stores the mail and the Dovecot POP and IMAP servers provide user access. For delivery I m using the LMTP server I wrote which has been almost ready for GPL release for a couple of years. All I need to write is a command-line parser to support delivery options for different local delivery agents. Currently my LMTP server is hard-coded to run /usr/lib/dovecot/deliver and has it s parameters hard-coded too. As an aside if someone would like to contribute some GPL C/C++ code to convert a string like /usr/lib/dovecot/deliver -e -f %from% -d %to% -n into something that will populate an argv array for execvp() then that would be really appreciated. Authentication is to a MySQL server running on a fast P4 system. The MySQL server was never at any fraction of it s CPU or disk IO capacity so using a different authentication system probably wouldn t have given different results. I used MySQL because it s what I m using in production. Apart from my LMTP server and the new version of Postal all software involved in the testing is from Debian/Squeeze. The Tests All tests were done on a 20G IDE disk. I started testing with a Pentium-4 1.5GHz system with 768M of RAM but then moved to a Pentium-4 2.8GHz system with 1G of RAM when I found CPU time to be a bottleneck with barrier=0. All test results are for the average number of messages delivered per minute for a 19 minute test run where the first minute s results are discarded. The delivery process used 12 threads to deliver mail.
P4-1.5 p4-2.8
Default Ext4 1468 1663
Ext4 max_batch_time=30000 1385 1656
Ext4 barrier=0 1997 2875
Ext4 on DRBD no secondary 1810 2409
When doing the above tests the 1.5GHz system was using 100% CPU time when the filesystem was mounted with barrier=0, about half of that was for system (although I didn t make notes at the time). So the testing on the 1.5GHz system showed that increasing the Ext4 max_batch_time number doesn t give a benefit for a single disk, that mounting with barrier=0 gives a significant performance benefit, and that using DRBD in disconnected mode gives a good performance benefit through forcing barrier=0. As an aside I wonder why they didn t support barriers on DRBD given all the other features that they have for preserving data integrity. The tests with the 2.8GHz system demonstrate the performance benefits of having adequate CPU power, as an aside I hope that Ext4 is optimised for multi-core CPUs because if a 20G IDE disk needs a 2.8GHz P4 then modern RAID arrays probably require more CPU power than a single core can provide. It s also interesting to note that a degraded DRBD device (where the secondary has never been enabled) only gives 84% of the performance of /dev/sda4 when mounted with barrier=0.
p4-2.8
Default Ext4 1663
Ext4 max_batch_time=30000 1656
Ext4 min_batch_time=15000,max_batch_time=30000 1626
Ext4 max_batch_time=0 1625
Ext4 barrier=0 2875
Ext4 on DRBD no secondary 2409
Ext4 on DRBD connected C 1575
Ext4 on DRBD connected B 1428
Ext4 on DRBD connected A 1284
Of all the options for batch times that I tried it seemed that every change decreased the performance slightly but as the greatest decrease in performance was only slightly more than 2% it doesn t matter much. One thing that really surprised me was the test results from different replication protocols. The DRBD replication protocols are documented here [2]. Protocol C is fully synchronous a write request doesn t complete until the remote node has it on disk. Protocol B is memory synchronous, the write is complete when it s on a local disk and in RAM on the other node. Protocol A is fully asynchronous, a write is complete when it s on a local disk. I had expected protocol A to give the best performance as it has lower latency for critical write operations and for protocol C to be the worst. My theory is that DRBD has a performance bug for the protocols that the developers don t recommend. One other thing I can t explain is that according to iostat the data partition on the secondary DRBD node had almost 1% more sectors written than the primary and the number of writes was more than 1% greater on the secondary. I had hoped that with protocol A the writes would be combined on the secondary node to give a lower disk IO load. I filed Debian bug report #654206 about the kernel not exposing the correct value for max_batch_time. The fact that no-one else has reported that bug (which is in kernels from at least 2.6.32 to 3.1.0) is an indication that not many people have found it useful. Conclusions When using DRBD use protocol C as it gives better integrity and better performance. Significant CPU power is apparently required for modern filesystems. The fact that a Maxtor 20G 7200rpm IDE disk [3] can t be driven at full speed by a 1.5GHz P4 was a surprise to me. DRBD significantly reduces performance when compared to a plain disk mounted with barrier=0 (for a fair comparison). The best that DRBD could do in my tests was 55% of native performance when connected and 84% of native performance when disconnected. When comparing a cluster of cheap machines running DRBD on RAID-1 arrays to a single system running RAID-6 with redundant PSUs etc the performance loss from DRBD is a serious problem that can push the economic benefit back towards the single system. Next I will benchmark DRBD on RAID-1 and test the performance hit of using bitmaps with Linux software RAID-1. If anyone knows how to make a HTML table look good then please let me know. It seems that the new blog theme that I m using prevents borders. Update: I mentioned my Debian bug report about the mount option and the fact that it s all on Debian/Squeeze. Related posts:
  1. I need an LMTP server I am working on a system where a front-end mail...
  2. Some Notes on DRBD DRBD is a system for replicating a block device across...
  3. paper about ZCAV This paper by Rodney Van Meter about ZCAV (Zoned Constant...

17 December 2011

Russell Coker: Some Notes on DRBD

DRBD is a system for replicating a block device across multiple systems. It s most commonly used for having one system write to the DRBD block device such that all writes are written to a local disk and a remote disk. In the default configuration a write is not complete until it s committed to disk locally and remotely. There is support for having multiple systems write to disk at the same time, but naturally that only works if the filesystem drivers are aware of this. I m installing DRBD on some Debian/Squeeze servers for the purpose of mirroring a mail store across multiple systems. For the virtual machines which run mail queues I m not using DRBD because the failure conditions that I m planning for don t include two disks entirely failing. I m planning for a system having an outage for a while so it s OK to have some inbound and outbound mail delayed but it s not OK for the mail store to be unavailable. Global changes I ve made in /etc/drbd.d/global_common.conf In the common section I changed the protocol from C to B , this means that a write() system call returns after data is committed locally and sent to the other node. This means that if the primary node goes permanently offline AND if the secondary node has a transient power failure or kernel crash causing the buffer contents to be lost then writes can be lost. I don t think that this scenario is likely enough to make it worth choosing protocol C and requiring that all writes go to disk on both nodes before they are considered to be complete. In the net section I added the following: sndbuf-size 512k;
data-integrity-alg sha1; This uses a larger network sending buffer (apparently good for fast local networks although I d have expected that the low delay on a local Gig-E would give a low bandwidth delay product) and to use sha1 hashes on all packets (why does it default to no data integrity). Reserved Numbers The default port number that is used is 7789. I think it s best to use ports below 1024 for system services so I ve setup some systems starting with port 100 and going up from there. I use a different port for every DRBD instance, so if I have two clustered resources on a LAN then I ll use different ports even if they aren t configured to ever run on the same system. You never know when the cluster assignment will change and DRBD port numbers seems like something that could potentially cause real problems if there was a port conflict. Most of the documentation assumes that the DRBD device nodes on a system will start at /dev/drbd0 and increment, but this is not a requirement. I am configuring things such that there will only ever be one /dev/drbd0 on a network. This means that there is no possibility of a cut/paste error in a /etc/fstab file or a Xen configuration file causing data loss. As an aside I recently discovered that a Xen Dom0 can do a read-write mount of a block device that is being used read-write by a Xen DomU, there is some degree of protection against a DomU using a block device that is already being used in the Dom0 but no protection against the Dom0 messing with the DomU s resources. It would be nice if there was an option of using some device name other than /dev/drbdX where X is a number. Using meaningful names would reduce the incidence of doing things to the wrong device. As an aside it would be nice if there was some sort of mount helper for determining which devices shouldn t be mounted locally and which mount options are permitted it MIGHT be OK to do a read-only mount of a DomU s filesystem in the Dom0 but probably all mounting should be prevented. Also a mount helper for such things would ideally be able to change the default mount options, for example it could make the defaults be nosuid,nodev (or even noexec,nodev) when mounting filesystems from removable devices. Initial Synchronisation After a few trials it seems to me that things generally work if you create DRBD on two nodes at the same time and then immediately make one of them primary. If you don t then it will probably refuse to accept one copy of the data as primary as it can t seem to realise that both are inconsistent. I can t understand why it does this in the case where there are two nodes with inconsistent data, you know for sure that there is no good data so there should be an operation to zero both devices and make them equal. Instead there The solution sometimes seems to be to run drbdsetup /dev/drbd0 primary - (where drbd0 is replaced with the appropriate device). This seems to work well and allowed me to create a DRBD installation before I had installed the second server. If the servers have been connected in Inconsistent/Inconsistent state then the solution seems to involve running drbdadm -- --overwrite-data-of-peer primary db0-mysql (for the case of a resource named db0-mysql defined in /etc/drbd.d/db0-mysql.res). Also it seems that some commands can only be run from one node. So if you have a primary node that s in service and another node in Secondary/Unknown state (IE disconnected) with data state Inconsistent/DUnknown then while you would expect to be able to connect from the secondary node is appears that nothing other than a drbdadm connect command run from the primary node will get things going.

8 September 2011

Russell Coker: Moving from a Laptop to a Cloud Lifestyle

My Laptop History In 1998 I bought my first laptop, it was a Thinkpad 385XD, it had a PentiumMMX 233MHz CPU, 96M of RAM, and an 800*600 display. This was less RAM than I could have afforded in a desktop system and the 800*600 display didn t compare well to the 1280*1024 resolution 17 inch Trinitron monitor I had been using. Having only 1/3 the pixels is a significant loss and a 12.1 inch TFT display of that era compared very poorly with a good Trinitron monitor. In spite of this I found it a much better system to use because it was ALWAYS with me, I used it for many things that were probably better suited to a PDA (there probably aren t many people who have carried a 7.1 pound (3.2Kg) laptop to as many places as I did). But some of my best coding was done on public transport. But I didn t buy my first laptop for that purpose, I bought it because I was moving to another country and there just wasn t any other option for having a computer. In late 1999 I bought my second laptop, it was a Thinkpad 600E [1]. It had twice the CPU speed, twice the RAM, and a 1024*768 display that displayed color a lot better. Since then I had another three Thinkpads, a T21, a T43, and now a T61. One of the ways I measure a display is the number of 80*25 terminal windows that I can display at one time, my first Thinkpad could display four windows with a significant amount of overlap. My second could display four with little overlap, my third (with 1280*1024 resolution) could display four clearly and another two with overlap, and my current Thinkpad does 1680*1050 and can display four windows clearly and another five without excessive overlap. For most of the last 13 years my Thinkpads weren t that far behind what I could afford to get as a desktop system, until now. A Smart Phone as the Primary Computing Device For the past 6 months the Linux system I ve used most frequently is my Sony Ericsson Xperia X10 Android phone [2]. Most of my computer use is on my laptop, but the many short periods of time using my phone add up. This has forced some changes to the way I work. I now use IMAP instead of POP for receiving mail so I can use my phone and my laptop with the same mail spool. This is a significant benefit for my email productivity, instead of having 100 new mailing list messages waiting for me when I get home I can read them on my phone and then have maybe 1 message that can t be addressed without access to something better than a phone. My backlog of 10,000 unread mailing list messages lasted less than a month after getting an Android phone! A few years ago I got an EeePC 701 that I use for emergency net access when a server goes down. But even a 920g EeePC is more weight than I want to carry, as I need to have a mobile phone anyway there is effectively no extra mass or space used to have a phone capable of running a ssh client. My EeePC doesn t get much use nowadays. A Cheap 27 inch Monitor from Dell Dell Australia is currently selling a 27 inch monitor that does 2560*1440 (WQHD) for $899AU. Dell Australia offers a motor club discount which pretty much everyone in Australia can get as almost everyone is either a member of such a club or knows a member well enough to use their membership number for the discount. This discount reduces the price to $764.15. The availability of such a great cheap monitor has caused me to change my working habits. It doesn t make sense to have a reasonably powerful laptop used in one location for almost all the time when a desktop system with a much better monitor can be used. The Plan Now that my 27 inch monitor has arrived I have to figure out a way of making things work. I still need to work from a laptop on occasion but my main computer use is going to be a smart-phone and a desktop system. Email is already sorted out, I already have three IMAP client systems (netbook, laptop, and phone), adding a desktop system as a fourth isn t going to change anything. The next issue is software development. In the past I haven t used version control systems that much for my hobby work, I have just released a new version every time I had some significant changes. Obviously to support development on two or three systems I need to use a VCS rigorously. I m currently considering Subversion and Git. Subversion is really easy to use (for me), but it seems to be losing popularity. Git is really popular so if I use it for my own projects then I could allow anonymous access for anyone who s interested maybe that will encourage more people to contribute. One thing I haven t even investigated yet is how to manage my web browsing work-flow in a distributed manner. My pattern when using a laptop is to have many windows and tabs open at the same time for issues that I am researching and to only close them days or weeks later when I have finished with the issue. For example if I m buying some new computer gear I will typically open a web browser window with multiple tabs related to the equipment (hardware, software, prices, etc) and keep them all open until I have received it and got it working. Chromium, Mozilla, and presumably other modern web browsers have a facility to reopen windows after a crash. It would be ideal for me if there was some sort of similar facility that allowed me to open the windows that are open on another system and to push window open commands to another system. For example when doing web browsing on my phone I would like to be able to push the URLs of pages that can t be viewed on a phone to my desktop system and have them open waiting for me when I get home. It would be nice if web browsing could be conceptually similar to a remote desktop service in terms of what the user sees. Finally in my home directory there are lots of random files. Probably about half of them could be deleted if I was more organised (disk space is cheap and most of the files are small). For the rest it would be good if they could be accessed from other locations. I have read about people putting the majority of their home directory under version control, but I m not sure that would work well for me. It would be good if I could do something similar with editor sessions, if I had a file open in vi on my desktop before I left home it would be good if I could get a session on my laptop to open the same file (well the same named file checked out of the VCS). Configuring the Desktop System One of the disadvantages of a laptop is that RAID usually isn t viable. With a desktop system software RAID-1 is easy to configure but it results in two disks making heat and noise. For my new desktop system I m thinking of using a DRBD device for /home to store the data locally as well as almost instantly copying it to RAID-1 storage on the server. The main advantage of DRBD over NFS, NBD, and iSCSI is that I can keep working if the server becomes unavailable (EG use the desktop system to ask Google how to fix a server fault). Also with DRBD it s a configuration option to allow synchronous writes to return after the data is written locally which is handy if the server is congested. Another option that I m considering is a diskless system using NBD or iSCSI for all storage. This will prevent using swap (you can t swap to a network device to avoid deadlocks) but that won t necessarily be a problem given the decrease in RAM prices as I can just buy enough RAM to not need swap. The Future Eventually I want to be able to use a tablet for almost everything including software development. While a tablet display isn t going to be great for coding I m sure that I can make use of enough otherwise wasted time to justify the expense. I will probably need a tablet that acts like a regular Linux computer not an Android tablet.

16 July 2011

Matthew Palmer: MySQL replication and crash recovery

The question was recently asked, How do I perform crash recovery on a MySQL master/slave database cluster? Because replication is async, a transaction that has been committed to the master may not be able to leave the master before the crash happens. The short answer is, you don t. Asynchronous replication, by it s very nature, is prone to these sorts of issues, and MySQL doesn t make life easier in general by using query replay. Some of the issues aren t massive: for instance, if a slave crashes, the master will keep replication logs for the slave until it comes back up (as long as the slave comes back before the master decides some of the replogs are too old and starts deleting them, of course), so as long as you can live without a master for a while, you can recover and continue along your merry way. Promoting a slave to be a master, on the other hand, is a situation frought with peril. Here, you ve just got to hope hope that your chosen slave is consistent with the master s state at death (because it s async, you have no guarantees about that), and that all the other slaves have the same ideas about what counts as the current state of the data. If your newly-promoted slave managed to apply an update that another slave didn t, that slave will be out of sync with reality (until such time as the now-dead master comes back and replays all those queries hahahaha). To even guarantee that any slaves have a consistent view of the data as compared to the new master, you ve got to rereplicate everything from the new master, because MySQL s where are you up to? counter is master-specific. I ve heard people who have to deal with this sort of thing these days say that the Maatkit tools are handy for dealing with various issues of this type (monitoring, repairing the database when it gets inconsistent). However, I prefer to keep right the hell away from MySQL replication altogether, after a year of having to gently coax an insane MySQL tree replication setup to do the right thing for any extended period, and watching a group of DBAs at Engine Yard go slowly insane. My preferred alternatives are: Never discount the value of the Jaws approach to scaling ( we re gonna need a bigger boat^Wmachine ) hardware is so ridiculously cheap, relative to the clueful sysadmin time needed to make (and keep) MySQL replication running, that you really need to be at the huge end of town (where other scaling issues are going to bite you in the arse first) before spending too much time on your database is worth the investment. Even then, you ll get more bang for your buck from: than you ever will from the constant, mind-numbing grind that is keeping MySQL replication going.
  1. The term comes from a time when a colleague near-screamed This slave cluster keeps fucking me! at a particularly intransigent set of MySQL slaves.

11 June 2010

Norbert Tretkowski: DRBD source dropped from Debian unstable

OK, not really... DRBD 8.3.7 is an official part of the Linux kernel since 2.6.33, and thanks to Dann Frazier we also have a backport of that patch in Debian's 2.6.32 kernel as well.

Until yesterday, the DRBD source package in Debian built a -utils and a -source package, the latter was now dropped. From now on, users no longer need to recompile the DRBD module after a kernel upgrade.

9 May 2010

Gregor Herrmann: RC bugs 2010/17, 2010/18

the usual short overview about my RC bug activities: luckily the count of RC bugs is going down; but some more people tackling the remaining ones would be great!

6 December 2009

Stefano Zacchiroli: RC bugs of the week - week 13

RCBW - week #13 Here are this week squashes, by yours truly: And here are the usual highlights:

26 September 2009

Stefano Zacchiroli: RC bugs of the week - week 4

RCBW - week #4 RC bugs squashed this week by yours truly: You're welcome to join the initiative!, in doing so please tag the blog post as "debian" and "rc".

15 February 2009

Michael Prokop: Debian GNU/Linux 5.0 codename Lenny - News for sysadmins

Alright, Debian GNU/Linux 5.0 AKA as Lenny has been released. Time for a Debian unstable unfreeze party! 8-) What does the new stable release bring for system administrators? I ll give an overview what news you might expect when upgrading from Debian GNU/Linux 4.0, codename Etch (released on 8th April 2007) to the current version Debian GNU/Linux 5.0, codename Lenny (released on 14th February 2009). I try to avoid duplicated information so make sure to read the release announcement and the official release notes for Lenny beforehand. Noteworthy Changes Virtualisation Virtualisation related new tools: Desktop oriented packages like virtualbox and qemu are available as well of course. Noteworthy Updates This is a (selective) list of some noteworthy updates: New packages Lenny ships over 7000 new packages. Lists of new/removed/replaced packages are available online. I ll name 238 sysadmin related packages that might be worth a look. (Note: I don t list addon stuff like optional server-modules, docs-only and kernel-source related packages. I plan to present some of the following packages in more detail in separate blog entries.) Further Ressources

1 February 2009

Andre Luis Lopes: KVMing for fun and profit : how to make your toy become a serious business

Here I ll try to present you some tips on how one could easily make a playground become a paid fun. And no, I won t be teaching you how to become rich, although I would like if you would teach me how that could be accomplished as I m trying to come up with a way to do just that for a couple of years now and failed miserably. Recently, I ve been setting up KVM virtual machines in order to do a lot of testing for a project I m involved with. Letting boring details aside, one could say that a very long, boring and error prone setup was finished after a lot of work and it worked like a charm in the end, as a KVM machine. As always, laziness is a given and when you are into it you empower yourself to come up with some nice hacks and goes that extra mile to find out how one could not to have to do boring and repetitive things again and again, in the best spirit of let s have some work now and save me ten times more work later . As I m becoming more and more into that spirit of laziness, there I went looking for a way which would let me transform my beloved and nice working KVM machine into a fully featured real physical server. After some tips from microblogosphere friends (thanks fike), I had some ideas on how that could be accomplished. Turned out my ideas weren t really all that right so there I went again looking for a way to do what I wanted. After some research I found out that the KVM guys already gave some thought about it and even had a ready solution to my problem : it s called qemu-nbd. Yes, by now everyone should know that KVM builds on top of the excelent (just not so snappy) QEMU. However, as I m a Debian user, here I should say that in Debian, most notably Debian sid/unstable, the qemu-nbd binary was renamed to kvm-nbd, just as the KVM binary is called only kvm and not qemu-system-x86_64 or whatever else is called these days upstream. NBD (short for Network Block Device) is a nice thing. I hadn t tried it before (I certainly used and am still using in some projects of mine the not-upstreamed-yet-but-so-nice-that-I-couldnt-resist DRBD, which seems to share some ideas which NBD). In short (read NBD link for more info on that), NBD allows the Linux kernel to use a remote server as a block device. Also, its upstream is a fellow Debian developer (hello Wouter). How nice s it ? As NBD is an upstreamed kernel feature, you don t need to go to the external-module-route-hell. You only need the userland tools, which are nicely packaged and integrated into Debian as well. They re only one aptitude away, so you could use this command to bring it to your KVM host :
aptitude install nbd-client
After that, you will want to export your image file (the file which represents the disk of you virtual machine) as a NBD block device, so you will be able to mount it under your KVM host and do whatever you want to do with it. Here s how one would do it :
kmv-nbd conect=/dev/nbd0 myimagefile.img
I tested it using both qcow2 and raw disk images and both worked like a charm so I don t think you will have any problems here. Next, you need to create a directory structure under your KVM host which temporarily will be used to mount the partitions inside the virtual block device of your KVM machine. You could do this like this :
mkdir /newserver mkdir /newserver/usr mkdir /newserver/var mkdir /newserver/home
In the example above, I ve created these directories because I want my new server to have separated root, /usr, /var and /home partitions and mount points. Feel free to adapt to any kind of partitioning layout you please. Next, you will mount yout virtual machine filesystem under a temporary location, so you will be able to copy all of its content later to your final desired destination, i.e. the real partitions inside the real disk you are going to use in your future real server (not a virtual machine anymore). I did it using :
mkdir -p /mnt/temp mount /dev/nbd0p1 /mnt/temp
Notice that the /dev/nb0p1 above is my root partition inside my KVM machine, only that it s represented now in my KVM host in NBD s device node notation here. I ve only one partition inside this particular virtual machine (aside from the swap partition). Next, you want to mount the partitions on the real disk you want to be used in your new real server under the directories created earlier so you will be able to chroot to then later and do any tweaks you may want to, like fixing /etc/fstab, /boot/grub/device.map, /boot/grub/menu.lst and all these places we know makes references to block devices which may differ from the situation you had in your virtual machine running under the KVM. Here s what I used :
mount /dev/sdb1 /newserver mount /dev/sdb3 /newserver/usr mount /dev/sdb5 /newserver/var mount /dev/sdb6 /newserver/home
The device nodes above are from your real disk s partitions, not from any virtualized device block access method like NBD. Also, take care not to mount your primary disk s partitions (the ones you are using under in your KVM host) as you could destroy them easily. As you can see above, I m using /dev/sdbX as /dev/sda is my primary disk, the one under which my KVM host is running. Again, adapt to your scenario. Next, you just copy everything from your virtual machine filesystem (which is now accessible from within your NBD block device) to your real disk. I did it using :
cp -a /mnt/temp/* /newserver/
After that, you can disconnect your KVM virtual machine image file from the exported NDB device block using :
kvm-nbd -d /dev/ndb0
And then chroot to your real disk layout yout just copied to /newserver in order to fix anything which would make reference to block devices while running as a virtual machine under KVM but which now will be a completely different thing as a real server, like your /etc/fstab, /boot/grub/device.map, /boot/grub/menu.lst and so on. In order to chroot to your new soon-to-be-real filesystem you could use :
chroot /newserver
Then go nuts and start fixing everything you may think needs to be fixed so this filesystem could be used in a real server to boot it up. Im my case, I needed to fix /etc/fstab in order to mount my additional partitions (as under KVM I was using only a root and a swap partition and now, under the real server, I ll be using separated root, swap, /usr, /var and /home). Before I went changing more things, I fixed my /boot/grub/device.map and my /boot/grub/menu.lst files to point to /dev/sdb and /dev/sdb[1], respectively, so I could run grub-install from inside the chroot in order to install GRUB into the real new disk s MBR (master boot record), using :
grub-install /dev/sdb
Then, finally, I fixed /boot/grub/device.map and /boot/grub/menu.lst again so they would point to /dev/sda and /dev/sda[1], as /dev/sda will be the device node for this new disk when it wil be running under the new real server and not as a disk inside the KVM host machine. Next, I exited the chroot using :
exit
And umounted the new real disk s partitions, using (now already out of the chroot) :
umount /newserver/home umount /newserver/var umount /newserver/usr umount /newserver
Now you can take this new disk out of your KVM host machine, put it inside your new real server, physically install it there, turn on your new real server and everything installed while the filesystem in it was being used under the KVM virtual machine will be there, running nicely. And you re done. That s it. Also, it s important to point out that it s much easier being done than being said (or written, in this case) so, if it looks scary because of the size of this post, don t let it disincourage you as this whole thing is a pretty straightforward procedure, much easier than it seems. And as I know that I will receive lost of complaints from people which will tell me how inefficient this whole thing is and how I could have accomplished the exact same thing in a much easier and faster way (perhaps even with a already existing tool which would automate almost entirely the whole proccess), I would say to these people to please exercise their right to share ideas using the comments. Just be nice on me, please. After all, it was fun and I learned new things while doing it all, as well as I tried to share what I learned from this experience with my peers. That s what matters. Peace, love and geekness to all of you :-)

27 August 2008

Ingo Juergensmann: re: Killing Servers with Virtualisation and Swap

Russel Coker blogs about a problem, which concerned me as well: Killing Servers with Virtualisation and Swap, i.e.: what happens when one domU is happily swapping the whole day?

Luckily this didn't happen to me yet, but as Russel do I tend as well to give a small sized swap area to domUs. If they're doing extensive swapping, something is wrong and most probably a task has a memory leakiing problem or the virtual machine is undersized with its RAM.

Anyway, Russel had some thoughts and ideas how to deal with the problem, such as giving dedicated swap areas on single disks to some domUs to seperate disk I/O for the filesystem and the I/O for swap. He proceeds with:

Now if you have a server with 8 or 12 disks (both of which seem to be reasonably common capacities of modern 2RU servers) and if you decide that RAID is not required for the swap space of DomUs then it would be possible to assign single disks for swap spaces for groups of virtual machines. So if one client had several virtual machines they could have them share the same single disk for the swap, so a thrashing server would only affect the performance of other VMs from the same client. One possible configuration would be a 12 disk server that has a four disk RAID-5 array for main storage and 8 single disks for swap. 8 CPU cores is common for a modern 2RU server, so it would be possible to lock 8 groups of DomUs so that they share CPUs and swap spaces. Another possibility would be to have four groups of DomUs where each group had a RAID-1 array for swap and two CPU cores.


I would myself consider a different approach:


Of course you can mix up both approaches or even use swap over NFS or NBD/DRBD. There are many ways and possible solutions, but the best way to deal with swapping is to take care that it doesn't happen, of course... ;)

20 June 2008

MJ Ray: Firefox 3, day 3: first impressions

Previously, I wrote:
Seriously: the browser looks like a big improvement from Firefox 2, but there are so many niggles with this download day idea...
In reply to Open Sesame Did you download Firefox 3?, I answer "Yes". It was a major upgrade for me, requiring new versions of Cairo and GTK+2, and installation of DBus-GLib on my GoboLinux computer, which brought in new versions of Xorg and so required a recompile of my GNUstep desktop applications. Once that was done, Firefox compiled unattended. As noted by Adam Sampson in the comments on my last post, even after building from source, you still get all the obnoxious click-through EULA and when you type about:config into the address bar, you get a "no user-servicable parts" sort of notice, which really sucks. I notice that MozCorp don't call it "100% Open Source", preferring instead Firefox: 100% Organic Software (because we need another marketing campaign for free software, right?), so I expect I need to winkle out the restrictively-licensed parts again - GNUzilla, there's still demand for your good work! After day 3 with Firefox 3, what do I think of it? Well, it seems a lot faster and a lot less RAM-hungry, and I'm quite impressed that all of the fancier bits of Koha and Wordpress seem to be working nicely but while I'm not annoyed enough to switch browsers yet (unlike FF3 and Safari - DrBacchus' Journal), there are still a hell of a lot of niggles and interface bugs. Some of the problems may have been introduced in Firefox 2, but I didn't actually use that enough to notice. My day-to-day browsing for the last year or so has been on a customised Firefox 1.5. The FF3 user interface has some big steps backwards from FF1.5: in particular, I've lost the "force pages that try to open new windows into the same window" option (or whatever it was called... I can't find the FF1.5 manual online anymore); some keyboard shortcuts have changed - for no good reason that I can see (JavaScript has switched from Alt-E n Alt-S to Alt-E n Alt-J, for example); what on earth is the history drop down doing next to the "Go Forward" arrow?; and the button to close a tab is on each tab, so I need to be careful to miss it when trying to switch to a tab and my pointer makes a pointless detour to the top-right when I want to close a tab. It's not all bad on the interface. The new RSS feed and bookmark links in the location bar are much better than in previous versions. The bookmark tagging and auto-generated folders could be a great idea once I've used it for a while. I'm pretty annoyed that Firefox 3 seems to come with some spyware enabled as default. I usually have cookies either switched off or set to "ask me every time" so I was surprised to be offered a cookie from safebrowsing.google.com! I know it's for a noble goal, but what's this doing enabled without asking first? Untick the "tell me if the site I'm visiting is ..." options in Edit: Preferences: Security if you don't want details of your browsing to be sent to the USA. Another thing which really annoys me is that the Firefox support site requires javascript and seems unhappy with my cookie settings. Not cool. Other than that, the main problems with Firefox 3 are omissions rather than bugs. For example, Microformats [Alex Faaborg] support was one of the long-trumpeted new features in Firefox 3, but they're really not obviously included, as noted by others in posts like Firefox 3 is here - where's the microformats? And finally, searching mozilla.com for firefox returns 0 hits, which is a bit strange... are they ashamed of it?

18 June 2008

Norbert Tretkowski: DRBD in Debian with Linux 2.6.25 and newer

The DRBD packages in Debian have been a bit outdated until yesterday, neither the package in unstable (8.0 branch) nor the package in experimental (8.2 branch) could be built against Linux 2.6.25. Three bugs have been filed against the DRBD packages because of that problem, which was fixed in newer upstream releases already.

After Philipp Hug accepted my offer to co-maintain the DRBD packages in Debian, I uploaded DRBD 8.0.12 to unstable and 8.2.6 to experimental. Both packages are working fine with Linux 2.6.25 and newer.

15 February 2008

Joerg Jaspert: PostgreSQL Continuous Archiving and Point-In-Time Recovery (PITR)

A nice feature from PostgreSQL are “write ahead log” (WAL) files. Quoting from the PostgreSQL website they describe every change made to the database’s data files. […] if the system crashes, the database can be restored to consistency by “replaying” the log entries made since the last checkpoint. This basically enables online backup of your database cluster. And as you end up having the complete log of all actions within your database(s) you are also able to go back in time to any point you like. I won’t write a complete howto about it, as the PostgreSQL docs are pretty good, just some notes how I did my setup, as a kind of reminder for me when I have to do it again.
There must be a place to store WAL files which ends up not being on the same machine, otherwise it doesn’t make much sense. For me that means scp to a different host, but an NFS mount or some storage array or drbd device might also be an option. Much of the following is for the scp method, but easily adaptable to other storage methods. I need a target user on the host I copy files to, and as I did the setup for the pentabarf database running on host skinner, I helpfully named it pentabarf-backup. The copy process can’t be interactive so the postgres user on the database host needs to have a passwordless SSH key to login to the pentabarf-backup user on the backup host. To limit my bad feeling about “passwordless key to login” I restrict the key by adding
from="skinner.debconf.org,72.32.250.224,72.32.79.241",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty 
in front of it while copying it to the pentabarf-backup users .ssh/authorized_keys file. I need a few directories, namely pentabarf-WAL-skinner, pentabarf-tarballs and pg_xlog-skinner for the stuff to backup, created in the home of the pentabarf-backup user. Next step is to tell PostgreSQL our wish to have online backups. Add the following statements to your postgresql.conf, adapting the values to your needs:
archive_command = 'ssh backup test ! -f pentabarf-WAL-skinner/%f && rsync -az %p backup:pentabarf-WAL-skinner/%f'
checkpoint_timeout = 1h
archive_timeout = 12h
The above tells PostGreSQL that i want it to archive the files using ssh to my backup host and transfer files with rsync (via ssh), but only if they do not exist yet. It also tells it to archive the files at least once every 12 hours. The effect is that WAL files get copied over when they are “full” or after 12 hours passed, whatever happens first. Don’t set it too low, as the file will always have the full, fixed size and you waste space transferring mostly-empty files. In case you are concerned with the possibly slow speed of a network link, the PostgreSQL site has the following to say about this: The speed of the archiving command is not important, so long as it can keep up with the average rate at which your server generates WAL data. Normal operation continues even if the archiving process falls a little behind. I haven’t had a problem with the speed yet. A 16MB file every (at least) 12 hours isn’t that bad. Now that we have done all the setup and are able to transfer WAL files, we need a base backup to start with. Issue a SELECT pg_start_backup(‘BASE_BACKUP’); as a database superuser and then backup the database cluster by backing up the whole clusters data/ directory (and every other location if you use tablespaces). When you are done issue SELECT pg_stop_backup(); and you are done with your base backup. You might want to exclude pg_xlog/ to save space, you don’t need it.
As I’m lazy I use a script (written by Peter ‘weasel’ Palfrader), you can get it here, but whatever you take is fine. If you use the script just make sure to adapt the variables in the script to values that fit your system. From now on you should do such base backups in regular intervals, I run them once a week, as recovery does need the last base backup plus all WAL files from then on. Doing regular base backups limits the amount of WAL files you need to keep around / you need to replay.
Reading the above you might find that there is a window of up to 12 hours from which I do not have backups. True, and to make this window way smaller, I am using another script (also written by weasel), get it here, together with an init script. This synchronizes the pg_xlog/ directory to my backup host every 5 minutes, so in case of a server crash I can replay everything up to the last 5 minutes. Recovery using WAL files This is pretty simple, and also good documented at the PostgreSQL site again. For my scenario it basically boils down to
  1. Stop postgres
  2. Move away the cluster.
    mv /var/lib/postgresql/8.2 /var/lib/postgresql/8.2_old
  3. Restore the backup dump, be careful to have the file rights correct. For the above described setup that means copy and untar the latest tarballs created weekly.
  4. Recreate pg_xlog:
      mkdir /var/lib/postgresql/8.2/main/pg_xlog
      chown postgres:postgres /var/lib/postgresql/8.2/main/pg_xlog
      chmod 40700 /var/lib/postgresql/8.2/main/pg_xlog
      mkdir /var/lib/postgresql/8.2/main/pg_xlog/archive_status
      chown postgres:postgres /var/lib/postgresql/8.2/main/pg_xlog/archive_status
      chmod 40700 /var/lib/postgresql/8.2/main/pg_xlog/archive_status
     
  5. Copy possibly unarchived WAL files from /var/lib/postgresql/8.2_old/main/pg_xlog into the newly created pg_xlog.
  6. Copy all the WAL segment files that are needed over from the backup host. That should be all the files on the backup host, directory pentabarf-WAL-skinner/. You should be able to limit it by taking all those files >= the latest *.backup file.
  7. Create a recovery command file recovery.conf in the cluster data directory. You may also want to temporarily modify pg_hba.conf to prevent ordinary users from connecting until you are sure the recovery has worked. The content of recovery.conf is
          restore_command = 'cp /tmp/pg-WAL/%f "%p"'
         
    Make sure that the file location you specify is the location you put the WAL segment files into in the prior step.
  8. Start the server, it will go into recover mode and proceed to read through all the archived WAL files it needs.
  9. Look at the database and decide if you like what you see.
Of course the procedure can be slightly adapted, so you end up with a database state from 2 weeks ago. Or 3 days. Or whatever you may need (and have WAL files for). There are many possibilities, and there is a lot of text on the PostgreSQL page that I don’t want to duplicate here, so please read it. Also, I’m not using this as the sole and only way to backup my database(s), they are all still dumped into SQL files every day. This is just additional.

9 February 2008

John Goerzen: A Cloud Filesystem

A Slashdot question today about putting to use all the unused disk space on corporate desktops got me to thinking. Now, before I start, comments there raised valid points about performance, reliability, etc.

But let's say that we have a "cloud filesystem". This filesystem would, at its core, have one configurable parameter: how many copies of each block of data must exist in the cloud. Now, we add servers with disk space to the cloud. As we add servers, the amount of available space on the cloud increases, subject to having enough space for replication according to our parameters.

Then, say we say we want a minimum of 3 copies of each block replicated. Each write to the filesystem will then cause a write to at least 3 different servers. Now, what if one server goes down? If the cloud filesystem is short on space, we may be down to only 2 copies of some blocks until that server comes back up. Otherwise, space permitting, it can rebuild that third copy on other servers.

Now, has this been done before? As far as I can tell, no. Wouldn't it be sweet?

But there are some projects that are close. Most notably, GlusterFS. GlusterFS does all of the above, except the automated bits. You can have this 3 copy redundancy, but you have to manually tell it where each copy goes, manually reconfigure if a server goes offline, etc. Other options such as NBD, OpenAFS, GFS, DRBD, Lustre, GFS, etc. aren't really well-suited for this scenario for various reasons.

So, what does everyone think? Can this work? Has it been done outside of Google?

Next.

Previous.